======================================================================================
## [1] "Fri Mar 8 12:53:11 2019"
Install packages / libraries / load data
The raw data has 113,937 rows and 81 columns. It contians information on each loan made by the Prosper Company of San Fransico from Q4-2005 unitl Q1-2014.
Input data provides specific information about: loan amount, borrower rate (or interest rate), borrower APR, current loan status, borrower income, borrower employment status and duration, borrower credit history, and the latest payment information.
## [1] 113937 81
Based on goals of this analysis 54 of the original columns were retained and the rest dropped. Where possible, the remaining column names were shortened.
The table below is a six statistic summary of the ‘BorrrowerAPR’ column.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229 25
The following four barplots of the ‘BorrowerAPR’ column show frequency on the y axis and Borrower’s APR on the x axis.
The first plot has a binwidth of 0.05.
In the next three plots the binwidth was gradually decreased. Smaller binwidths show the extent to which Prosper uses fine increments in interest rates.
A high frequency of loans are made at a rate of roughly 0.365 percent however this fact is not visibly apparent when the chosen binwidth is 0.05.
We can now see that there is a hidden spike in loan frequencies at the roughly the 0.365 rate using a binwidth of 0.01 rather than 0.05.
By decreasing the binwidth even further we notice the incremental difference interest rates being charged by the Prosper company doesn’t seem feasible.
My conclusion is that these interest rates are randomly generated by the individual or individuals providing the data and that these are not true interest rates being charged by banks because there are 6677 different rates mostly with a difference of 0.00001 which is extremely unlikely.
Had I not explored using different binwidths I probably would not have discovered this.
The six statistic summary below of ‘BorrowersRate’ (interest rate) shows that most interest rate are between 0.134 and 0.498 percent.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
The plot of ‘BorrowerRate’ below using a binwidth of 0.001 increases the granulartiy visible in the dispersion of interest rate values.
The next plot below displays the ‘BorrowerRate’ distribution as a density for comparison to the bar plot above.
This density curve shows the relative percentage of total on the y axis rather than count as in the plot above.
The concentration of loans around the rate of 0.14 and the spike in frequency at the 3.25 rate level are noticable.
The six statistic summary of ‘LoanTerm’ below indicates the median loan term is 36 months a minimum of 12 months and a maximum of 60 months.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 36.00 36.00 40.83 36.00 60.00
The histogram of ‘LoanTerm’ below identifies three loan term lengths as either (12, 36, or 60) months, the most frequent being 36 months visually repeating the results of the summary statistics above.
The table of the ‘BorrowerState’ below shows counts of loans made by each state.
##
## AK AL AR AZ CA CO CT DC DE FL GA
## 5515 200 1679 855 1901 14717 2210 1627 382 300 6720 5008
## HI IA ID IL IN KS KY LA MA MD ME MI
## 409 186 599 5921 2078 1062 983 954 2242 2821 101 3593
## MN MO MS MT NC ND NE NH NJ NM NV NY
## 2318 2615 787 330 3084 52 674 551 3097 472 1090 6729
## OH OK OR PA RI SC SD TN TX UT VA VT
## 4197 971 1817 2972 435 1122 189 1737 6842 877 3278 207
## WA WI WV WY
## 3048 1842 391 150
The following barplot, shows the count of loans by state on the y axis and the x axis lists each state.
The first bar in the lower left of the graph shows over 5 thousand borrowers had not indicated which state they were located in.
The following table of EmploymentStatus’ gives counts for each employment group.
## Employed Full-time Not available Not employed
## 2255 67322 26355 5347 835
## Other Part-time Retired Self-employed
## 3806 1088 795 6134
The following barchart helps visualize the counts of the different employment groups Prosper is making loans to.
Columns 1,4 & 6 contain 11,408 loans without a without a clearly defined employment type.
Perhaps these loans weren’t made to individuals but to some kind of business or organization.
The table below of ‘HomeMortgage’, tells us the number of borrowers who have a home mortgage and the number of those who don’t have one.
Its clear from these counts that Prosper makes loans to both those with and without a home mortgage in about the same frequency.
## False True
## 56459 57478
The following table on ‘IncomeRange’ shows the counts borrowers in each grouping.
Two of the values don’t represent income ranges which account for 8,547 loans.
The order of the columns are not in an sensible order like increasing from left to right.
## $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999
## 621 7274 17337 32192 31050
## $75,000-99,999 Not displayed Not employed
## 16916 7741 806
After converting the ‘IncomeRange’ variable into an ordered factor we re-plot it to create a more accurate visualization of the actual income distribution for the rows containing valid income ranges.
The table below divides ‘IncomeVerifiable’ into the counts of yes and no values. no.
## False True
## 8669 105268
The barplot ‘IncomeVerifiable’ below, shows the same difference in yes and no counts visually.
A visual comparison seems to provide a more meaningful message.
The following table of ‘LoanOrigainationQuarter’, is an un-ordered count of loans made by Prosper company druing each quarter.
##
## Q1 Q2 Q3 Q4
## 29678 24906 27967 31386
The following plots shows the count of loans made by Prosper each quarter.
The table below provides the number of loans originated by year.
##
## 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 22 5906 11460 11552 2047 5652 11228 19553 34345 12172
The following barplot of “OriginationYear” display counts of loans originated yearly by the Prosper.
Added a new column called ‘LoanCode’ to the data frame uni_data that correctly renames the the loan category values that are currently integers with a human readable value using a string.
After producing a summary table of the ‘LoanCategory’ I changed the data type of ‘LoanCategory’ from integer to factor creating a different table showing the counts for each loan category.
## Not-Available Debt-Consolidation Home-Improvement
## 16965 58308 7433
## Business Personal-Loan Student Use
## 7189 2395 756
## Auto Other Baby-Adoption
## 2572 10494 199
## Boat Cosmetic-Procedure Engagement-Ring
## 85 91 217
## Green Loans Household-Expenses Large Purchases
## 59 1996 876
## Medical-/-Dental Motorcycle RV
## 1522 304 52
## Taxes Vacation Wedding Loans
## 885 768 771
The bar plot below shows the ‘LoanCode’ counts of 20 categories of loans the Prosper company makes, consolidation loans being the most frequent.
The table below summarizes the counts of the eight credit grades.
## A AA B C D E HR NC
## 84984 3315 3509 4389 5649 5153 3289 3508 141
The bar plot below displays the eight ‘CreditGrades’ distribution that includes a group of 80K unclassified loan grades.
These credit grades only apply to loans prior to the year 2009 and therefore this variable is not consistant across the data set and should be used with caution in any calculations.
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
The barchart below shows the loan fruquencies of each ‘LoanStatus’ category.
The table below summarizes the number of CurrentDelinquencies.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.5921 0.0000 83.0000 697
The histogram below shows the frequency of the ‘LoanDaysDelinquent’ variable.
The following cell displays the results of the summmary function producing a six statistic table of the CreditScoreRangeLower and CreditScoreRangeUpper.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 660.0 680.0 685.6 720.0 880.0 591
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 19.0 679.0 699.0 704.6 739.0 899.0 591
Added ‘CreditScoreMean’ column by calculating the mean of the difference between ‘CreditScoreRangeLower’ and ‘CreditScoreRangeUpper’. The following plots shows the mean of the added column CreditScoreMean in a dark red dashed verticle line.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 10.0 670.0 690.0 695.6 730.0 890.0 591
## Mean
## 695.5677
## [1] "Fri Mar 8 12:53:58 2019"
The initial raw data set contained 113,937 (observations / rows), with 81 (variables / columns), on each loan, such as: loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information, to name a few.
In the early stage of this analysis twenty seven rows were dropped bringing the number of columns at this stage of the analysis to 54.
The cell below describes the structure of the cleaned data frame that was used in the univatiate analysis.
## [1] 113937 59
Features relating to the five C’s of Credit analysis: (capacity, capital, conditions, character & collateral).
Because most of the loans Prosper makes are consolidation loans and not home loans varialbes representing colateral are not present in this data.
Supporting features of interest
Columns 26 & 27 (used to create the CreditScoreMean column ‘55’ in the uni_data so that the standard deviation of the CreditScoreMean could be calculated to enalbe analysis of the spread in CreditScoreMeans as they relate to the 5 C’s.
Added the ‘LoanCode’ column ’ changing the Loan Cateories from numerics vlues to string values in order to generate meaningful axis lables on the plots. This brought the total column count up to 56.
Added ‘OriginationQuarter’ & ‘OriginationYear; columns by extracting two strings from the ’LoanOriginationQarter’ column bringing the column count up to 58 columns.
Added ‘IncomeRage_ordered’ to rearrange the vlaues in the ‘IncomeRange’ column in ascending order for those columns that had monetary values bringing the column count up to 59.
The BorrowerAPR was very unusual in that, although there were over 113,000 different loans contained in this data set, there were 6677 different interest rates each separated by a difference of just 0.00001 percent.
It would be unrealistic for a bank to be able to apply this many different rates to its pool of borrowers. With out performing this exploratory analysis I wouldn’t have stumbled upon this finding.
Columns Modified
Used the lubridate package year function to capature the date field that enabled plotting of the data in the “LoanOringinationDate” column.
This transformation was done to enable plotting of time data.
Ordered factor variables in the ‘IncomeRange’ column improving plot readability.
Rotated the x_tic labels in three plots to improve the display of those labels’
EmplymentStatus
IncomeRange
LoanCode (replaced the LoanCategory column for plotting this data)
Columns dropped:
columns 1,2 (listing numbers aren“t” any of the 5 C’s of loan risk analysis (Capital, Capacity, Conditions, Character or Colateral)
cloumns 10:16, (these columns pertain to business efficiency and profitability but give little informatin on any of the 5 C’s of loan risk analysis)
columns 23:24 (insufficient information is available regarding these values)
columns 39:40 (relate to public records but we don’t know the significance here)
column 46 (not sure what kinds of Trades these are and how they relate to the 5 C’s)
column 52:58 (each column had roughly 90K NAs or missing data so only 10% of the observations provide information on this feature)
columns 59 & 61 (each columns has over 95K NAs, same as above)
columns 69:72 (columns relate to costs and profitability of Prosper’s business operations without giving information relevant to the 5 C’s.
Rename (copy) uni_data to bi_data for further bivariate analysis
The data frame above was renamed to bi_data and will be used for the bivariate section of analysis.
The plot below is a pairs plot using the ggplot2 package and ggpairs function to show whether or not the two variables are correlated.
Bivariate Plot 1 analysis
BorrowerRate vs. CurrentDelinquencies were the two variables in the first bivariate plot.
The results show a correlation of 0.177 indicating a weak correlation.
Bivariate plot 2 below uses the ggpairs function to display the correlation between these two variables “LoanDurationMonths”, and “CurrentDelinquencies” and applies a smoothing function and a facet arrangement of the plots with a mapping of the “CurrentDelinquencies” to the color aesthetic.
Bivariate Plot 2 analysis
LoanDurationMonths vs. CurrentDelinquencies were the two variables plotted in the second bivariate plot producing a weak correlation coefficeint of 0.248.
Bivariate plot 3 below uses the ggpairs function to display the correlation between these two variables “EmploymentStatus”, and “CurrentDelinquencies” and applies a smoothing function and a facet arrangement of the plots with a mapping of the “CurrentDelinquencies” to the color aesthetic.
Bivariate Plot 3 analysis
EmploymentStatus vs. CurrentDelinquecies were used in the third bivariate plot. In this plot a correlation calculation was not returned with the plot because the inputs are non numeric.
Bivariate plot 4 below uses the ggpairs function to display the correlation between these two variables “IncomeVerifiable”, and “CurrentDelinquencies” and applies a smoothing function and a facet arrangement of the plots with a mapping of the “CurrentDelinquencies” to the color aesthetic.
Bivariate Plot 4 analysis
IncomeVerfiable vs. CurrentDelinquencies were plotted with the pairs plot fuction which shows that although the number of delinquencies is lower for the borrowers with non-verifiable income this is most likely because lenders simply do not make as many loans to entities without know in advance if the lender is capable of repaying a loan.
If an entity had extra cash laying around it probably wouldn’t need loan so this is why verified income although it might not be usefull in determining which loans will be repayed it can help in the decision to make the loan in the first place.
Bivariate plot 5 below uses the ggpairs function to display the correlation between “IncomeRange_ordered”, and “CurrentDelinquencies” and applies a smoothing function and a facet arrangement of the plots with a mapping of the “CurrentDelinquencies” to the color aesthetic.
Bivariate Plot 5 analysis
IncomeRange_ordered vs. CurrentDelinquencies were plotted in the 5th bivarate plot. This plot visibly shows that the higher the income range the lower the CurrentDelinquencies rate.
Bivariate plot 6 below using ggpairs displays the correlation between “IncomeRange_ordered”, and “CurrentDelinquencies” while appling a smoothing function and facet arrangement with a mapping of the “CurrentDelinquencies” to the color aesthetic.
Bivariate Plot 6 analysis
OverdueLast7Year vs CurrentDelinquencies were plotted in the sixth bivariate plot showing that a correlation coefficient of.378 was returned indicating a significance in the relationship between these two variables.
Bivariate plot 7 below is a pairs plot displaying correlations between “CreditScoreMean”, and “CurrentDelinquencies” while appling a smoothing function and facet arrangement with a mapping of the “CurrentDelinquencies” to the color aesthetic.
Bivariate Plot 7 analysis
CreditScoreMean vs. CurrentDelinquencies were plotted in the seventh bivaraite plot.
It turns out that, there is a significant negative correlation between them calculated to be -0.368.
In other words, as the ‘CreditScoreMean’ increases, the ‘CurrentDelinquencies’ rate decreases.
This explains why a Credit Score is such a key component of loan risk analysis.
Bivariate plot 8 below is a pairs plot displaying correlations between “FriendsAmountInvested”, and “CurrentDelinquencies” while appling a smoothing function and facet arrangement with a mapping of the “CurrentDelinquencies” to the color aesthetic.
Bivariate Plot 8 analysis
‘FriendsAmountInvested’ vs. ‘CurrentDelinquencies’ were calculated to have a correlation cooefficient of 0.0153 and is not considered significant for this relationship.
Bivariate plot 9 below is a pairs plot displaying correlations between “TotalInquiries”, and “Investors” while appling a smoothing function and facet arrangement with a mapping of the “Investors” to the color aesthetic.
Bivariate Plot 9 analysis
‘TotalInquiries’ vs. ‘Investors’ were plotted using the pairs plot function in ggplot2 and returned a correlation coefficient of 0.0263 meaning little correlation is present between these two varaibles.
Bivariate plot 10 below is a pairs plot displaying correlations between “FriendsAmountInvested”, and “DebtToIncomeRatio” while appling a smoothing function and facet arrangement with a mapping of the “FriendsAmountInvested” to the color aesthetic as earlier plots.
Bivariate Plot 10 analysis
‘FreindsAmountInvested’ vs. ‘DebtToIncomeRatio’ were plotted to visually display the correlation between these two vars as well as the correlation coefficient of 0.0279 which is not an indication of any significant correlation between the two.
Bivariate plot 11 below is another pairs plot displaying correlations between “CreditScoreLower”, and “LoanOriginalAmount” while appling a smoothing function and facet arrangement with a mapping of the “CreditScoreLower” to the color aesthetic as was similarly dones in earlier plots.
Bivariate Plot 11 analysis
‘CreditScoreLower’ vs. ‘LoanOriginationAmount’ were the two variables used in this bivaraiate plot producing a correlation coefficient of 0.341 meaning the relationship between these two variables is at the lower end of the significant level.
Bivariate Plot Analysis Summary
The three variables with the highest correlation coefficients were the ‘Overdue Last7Years’ at 0.378, then ‘CreditScoreMean at -0.068, and finally ’CreditScore Lower’ at 0.341.
The variable with the next highest correlation was ‘Freinds AmountInvested’.
Analyzing these results we can conclude that a persons past credit performance for the most part along with some influence from friends support might provide many of the characteristics that lend to predicting future loan outcomes.
The bivariate plot of ‘CreditScoreLower’ vs. ‘LoanOriginalAmount’ calculates a correlation coefficient of 0.341 which was one of the top three strongest correlations discovered thus far.
The strongest relationship btween two variables I found thus far was between the ‘OverdueLast7Years’ vs. the ‘CurrentDelinquencies’ columns. the correlation cooeficiennt was calculated to be 0.378.
Renamed (copied bi_data set) to multi_data for further multivariate analysis
Eleven multivariate plots considering different relationships between three or more variables will follow.
Each of these variables were analyzed above to determine what their relationship was with one other variable (a bivariate analysis).
I’ll be using colors, sizes and shapes to show how these additional varaibles related with each of the first two variables.
In the following plots I’ll be looking at relationships between: CreditScoreMean, CurrentDelinquencies, HomeMortgage, IncomeVerfiable, Term, LoanCode, DebtToIncomeRatio, BorrowerAPR, IncomeRange_ordered, InvestorsFriendCount, FriendsAmountInvested & Occupation.
Multivariate_Plot_a, compares ‘CurrentDelinquencies’ along the x axis with CreditScoreMean’ along the y axis.
The value of the HomeMortgage which can either be true or false is coded in color.
In this plot I am looking to see if having a HomeMortgage is related to CurrentDelinquencies and CreditScoreMean.
Becuase all of the red points fall mainly lower and farther to the right borrowers with home mortgages tend to have better CreditScoreMeans and lower variance in the number of loan payment delinquencies.
Multivariate_Plot_b, compares CreditScoreMean’ along the x axis with ‘CurrentDelinquencies’ along the y axis.
In this plot however, we want to know if having a verified income with the lender is related to delinquent loan payments and or the borrower’s credit score mean.
The low percentage of red points compared to green points indicates very few loans are made to borrowers without a verfied income.
For those loans that have been made to borrowers without a verified income the data suggests that few loans made to this group have the lowest level of payment delinquencies.
Very few of the red points lie along the x axis compared with the green points.
Multivariate_Plot_c, compares’Term’ along the x axis with ‘CurrentDelinquencies’ along the y axis.
It looks as though from this plot, that nearly all loans with a 60 month term are going to borrowers with a home mortgage and the delinquency rate among those loans is relatively low.
On the other hand, loans with a 12 month term appear to be almost evenly disributed among the borrowers who have a home mortgage and borrowers who don’t have a home mortgage.
However, the most prevalent loan term (the 36 month term) data suggests that payment delinquencies by non mortgage holders is far greater than the number of payment delinquencies by the borrowers who have a home mortgage.
Moreover the frequency of current delinquencies (red points) is more prevalent in the 36 month term category as can be seen in the higher relatvie number of red points to green points.
Multivariate_Plot_d, compares ‘CurrentDelinquencies’ along the x axis with ‘LoanCode’ along the y axis.
I’ve added ‘IncomeRange_ordered’ using color visualizing the relationship of ‘IncomeRange_ordered’ with loan purpose listed as ‘LoanCode’ and current payment delinquencies listed as ‘CurrentDelinquencies’.
The conlusion I draw from this plot is that current payment delinquencies are fairly evenly distributed across each of the income groupings. Other than that, there are a few categories of loans that have fewer payment delinquencies that most of the others, such as: “Green Loans”, “Boat Loans”, however this is most likely due to fewer numbers of these loans being made.
Multivariate_Plot_e, compares’DebtToIncomeRatio’ along the x axis with ‘CurrentDelinquencies’ along the y axis.
Using ‘IncomeRange_ordered’ with color to stratify the current payment delinquencies by income group. An initial observation is this lender is reluctant to approve loans above a DTI of greater than roughly 35%, although there are some loans made over that level.
It appears that the number of current delinquencies is fairly evenly distributed among the various income range groupings.
Multivariate_Plot_f, compares ‘BorrowerAPR’ along the x axis with ‘CurrentDelinquencies’ along the y axis.
The plot indicates that the distribution of current delinquencies is normally distributed among the 36 month loan term shown in greed.
It is less obvious what the distribution is for the 12, and 60 month loan terms.
Multivariate_Plot_g, compares ‘BorrowerAPR’ along the x axis with ‘DebtToIncomeRatio’ along the y axis.
To obatain a better visualization the upper 1% of values (outliers) were removed by subsetting the current data.
From this plot we can confirm that the mean ‘BorrowerARR’ is roughly .22 and that the mean DTI is roughly .28.
We can also see that the distribution of each varaible potted against the other is normally shaped. Finally, the distribution with regard to the ‘Term’ appears to be nearly identical for each of the three term lengths.
Multivariate_Plot_h, compares ‘CurrentDelinquencies’ along the x axis with ‘IncomeRange_ordered’ along the y axis using color for the ‘Term’ variable.
This plot shows that the number of current delinquencies is fairly evenly distributed between the upper four income ranges.
Taking a second look at the summary for the ‘IncomeRange_ordered’ varaible we see that only 621 loans were made to borrowers in the lowest income category and that only 7274 loans out of more than 110,000 loans were made to borrowers in the $1-24,999 income range which is about 0.06 percent of total loans.
Multivariate_Plot_i, compares ‘CurrentDelinquencies’ along the x axis with ‘FriendsAmountInvested’ along the y axis using color with the’CreditScoreMean’ variable.
As’FreindsAmountInvested. increases,‘CurrentDelinquencies’ decrease.
This is perhaps one of the fundamental pricipals in how peer funding works.
By having peers involved in the loan process the on time re payments increase.
Multivariate_Plot_j, compares ‘CreditScoreMean’ along the x axis with ‘FriendsAmountInvested’ along the y axis using color for the ‘FreindsAmountInvested’ variable.
The frequency in ‘FreindsAmountInvested’ values appear to show no difference in whether or not the borrower has a home mortgage or not.
Multivariate_Plot_k, compares ‘CurrentDelinquencies’ along the x axis with ‘Occupation’ along the y axis using color for the ‘HomeMortgage’ variable.
From this plot we can visually see that the borrowers without a home mortgage appear to have a greater frequency of current delinquencies compared to the borrowers who have a home mortgage.
This is evidenced by the green points coalescing on the left along the y axis and the red points gravitating toward the right side along the y axis.
Linear Models
The results of the linear model R squared value is displayed below.
## Model R-squared value = 0.2425937
Through this analysis I’ve discovered that the ‘CreditScoreMean’ and ‘CurrentDelinquencies’ are inversely related.
Through the analysis I’ve also dsicovered that on average borrowers who have a ‘HomeMortgage’ have fewer “CurrentDelinquencies’ as well as that, the higher the Borrower’s ‘DebtToIncomeRatio’ value, the higher the”BorrowerAPR’ / annual effective interest rate.
I created a linear model attained an R-squared value of 0.0243 which isn’t too good.
Final Plot 1 shows the relationship between the dependent or response variable ‘CurrentDelinquencies’ and the independent variable ’CreditScoreMean which was divided into six levels with the cut function.
A regresssion line was plotted through the pints to show that the two variables are inversely related
Final Plot 2 shows the relationship between the dependent variable ‘CurrentDelinquencies’ which was plotted on the x axis instead of the y axis for ease of labeling of the tick labels.
Color was added to this plot to show the interrelationship of the ‘CreditScoreMean’ variable. This plot clearly shows that the lower credit scores have the greatest number of payment delinquencies.
Final Plot three shows the relationship between ‘LoanCode’ which are the loan categories on the y axis and ‘CurrentDelinquencies’ on the x axis.
The’CreditScoreMean’ variable was cut into seven groupings and used interactively in this plot.
This plot demonstrates the major significance of the ‘CreditScoreMean’ or most credit rating systems in general.
Here we can see that very few delinquencies show up for the lowest credit scores irregardless of laon purpose.
This is most likey because lenders are not lending that much to borrowers with very low credit scores.
On the other hand we can also see that the borrowers with the highest credit scores tend to have the fewest number of delinquent payments irregardles of the purpose of the loans.
We can also see that certain categories of loans are not as prevalent like boat loans and RV loans for example.
Conclusions:
This was a challenging project in a number of ways.
First of all, the data set was rather large in both the number of rows and columns.
I found that one of the greatest challenges for me was to get the coding right for making the plots.
Once I could accomplish that task I began to see the shortcomings in the data and had to start making adjustments and modifications to get the plots to work.
I think learning to utilize the ggplot2 package and a number of other packages made the task much easier than it would have without them.
Possible improvements for this project would incorporate model predictions and prediction accuracy calculations.
References:
Session Info
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] xtable_1.8-3 sessioninfo_1.1.1 colorspace_1.4-0
## [4] RColorBrewer_1.1-2 reshape2_1.4.3 reshape_0.8.8
## [7] quantmod_0.4-13 TTR_0.23-4 xts_0.11-2
## [10] zoo_1.8-4 purrr_0.3.1 car_3.0-2
## [13] carData_3.0-2 plotly_4.8.0 leaflet_2.0.2
## [16] lubridate_1.7.4 memisc_0.99.14.12 MASS_7.3-51.1
## [19] lattice_0.20-38 scales_1.0.0 gridExtra_2.3
## [22] ggpubr_0.2 GGally_1.4.0 ggthemes_4.1.0
## [25] ggplot2_3.1.0 testthat_2.0.1 hms_0.4.2
## [28] forcats_0.4.0 tibble_2.0.1 broom_0.5.1
## [31] data.table_1.12.0 dygraphs_1.1.1.6 dplyr_0.8.0.1
## [34] plyr_1.8.4 stringr_1.4.0 magrittr_1.5
## [37] readxl_1.3.0 readr_1.3.1 tidyr_0.8.3
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.0 jsonlite_1.6 viridisLite_0.3.0
## [4] shiny_1.2.0 assertthat_0.2.0 cellranger_1.1.0
## [7] yaml_2.2.0 pillar_1.3.1 backports_1.1.3
## [10] glue_1.3.0 digest_0.6.18 promises_1.0.1
## [13] htmltools_0.3.6 httpuv_1.4.5.1 pkgconfig_2.0.2
## [16] haven_2.1.0 openxlsx_4.1.0 later_0.8.0
## [19] rio_0.5.16 generics_0.0.2 withr_2.1.2
## [22] repr_0.19.2 lazyeval_0.2.1 cli_1.0.1
## [25] crayon_1.3.4 mime_0.6 evaluate_0.13
## [28] nlme_3.1-137 foreign_0.8-71 tools_3.5.2
## [31] munsell_0.5.0 zip_1.0.0 compiler_3.5.2
## [34] rlang_0.3.1 grid_3.5.2 htmlwidgets_1.3
## [37] crosstalk_1.0.0 labeling_0.3 base64enc_0.1-3
## [40] rmarkdown_1.11 codetools_0.2-16 gtable_0.2.0
## [43] abind_1.4-5 curl_3.3 R6_2.4.0
## [46] knitr_1.21 stringi_1.3.1 Rcpp_1.0.0
## [49] tidyselect_0.2.5 xfun_0.5
## [1] "Fri Mar 8 12:57:25 2019"